Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial compact serialization proposal #131

Merged
merged 4 commits into from
Mar 23, 2019

Conversation

whyrusleeping
Copy link
Member

Review welcome, hoping to merge by Friday the 8th unless significant changes are needed.

data-structures.md Outdated Show resolved Hide resolved
| block v1 | 43 |
| message v1 | 44 |
| signedMessage v1 | 45 |
| actor v1 | 46 |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach for actors would be to simply embed them directly in the HAMT (as we currently do).

Adding a tag prefix to actors only makes sense if they are made into separate ipld objects.

Thoughts welcome here too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think actors as their own objects makes more sense in the long run

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we shouldn't make requirements on block boundaries in IPLD dags. That is, users should be able to embed or reference by CID.

Unfortunately, we do this all the time anyways and this requirement is hard-coded everywhere...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien yeah, ideally it wouldnt matter, but we live in the world where it unfortunately does :/

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

99% of the time I’m in the “no requirements on block boundaries” camp, but in the case of Filecoin it would concern me a little bit. allowing for a variations on block boundaries creates another vector to consider in performance and possibly even security. being rigid about the block format seems appropriate.

| Uint64 | CBOR major type 0 |
| BigInteger | [CBOR bignum](https://tools.ietf.org/html/rfc7049#section-2.4.2) |
| Address | CBOR major type 2 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing uint8 for completness

For example, a message would be encoded as:

```cbor
tag<44>[msg.To, msg.From, msg.Nonce, msg.Value, msg.Method, msg.Params]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about changes over time? would that result in a new tag? or should there be a version field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new tag

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that really worth the few bytes? The object's already going to be at least 55 bytes, likely much more. We can turn this into a map of numeric tags to fields with 5 additional bytes (a single byte key per field).

Really, it's not a huge deal but something to consider.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 5 bytes arent really worth it for the block, but they are for the messages, and I wanted to have a consistent format.

messages (with IDs) using this format will likely be under 20 bytes, and messages will take up the bulk of data being moved around. adding another 5 bytes is rather painful.

Also, does a map of numeric tags to fields really buy us that much? it ends up looking like a hacky array. (though i guess it does allow us to leave fields out)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under 20 bytes? Are you somehow skipping fields?

Also, does a map of numeric tags to fields really buy us that much? it ends up looking like a hacky array. (though i guess it does allow us to leave fields out)

Really, this just allow us to write /ipld/... paths. However, we can also just introduce a new pathing scheme, /filecoin/ for filecoin paths. However, this does mean that graphsync will need to support filecoin paths out of the box.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien

To and From fields both around 4-5 bytes, nonce will be 1 byte most of the time, value will be a few bytes, up to 8 bytes in some bad cases, but hopefully around 2-3 bytes if we can help it. Method is one byte, and then params will be more. But a simple 'send money' transaction won't have any params.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien also, with our own custom CID type, we can translate nice pathing schemes out of it. this is how we do pathing through other blockchain formats (and unixfs :/ )

| block v1 | 43 |
| message v1 | 44 |
| signedMessage v1 | 45 |
| actor v1 | 46 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think actors as their own objects makes more sense in the long run

data-structures.md Outdated Show resolved Hide resolved
@dignifiedquire
Copy link
Contributor

Missing a clear definition of the order in the array, also might be nice to look at the structs and byte align them, where possible/useful to 2 bytes.

@whyrusleeping
Copy link
Member Author

@dignifiedquire why would byte aligning them matter?

@dignifiedquire
Copy link
Contributor

@dignifiedquire why would byte aligning them matter?

I don't think it does in this context. There was just an inner voice that said I should think about byte alignment.

@whyrusleeping
Copy link
Member Author

@dignifiedquire I too have been touched in strange places by memory offsets... We should start a support group.

For example, a message would be encoded as:

```cbor
tag<44>[msg.To, msg.From, msg.Nonce, msg.Value, msg.Method, msg.Params]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that really worth the few bytes? The object's already going to be at least 55 bytes, likely much more. We can turn this into a map of numeric tags to fields with 5 additional bytes (a single byte key per field).

Really, it's not a huge deal but something to consider.

| block v1 | 43 |
| message v1 | 44 |
| signedMessage v1 | 45 |
| actor v1 | 46 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we shouldn't make requirements on block boundaries in IPLD dags. That is, users should be able to embed or reference by CID.

Unfortunately, we do this all the time anyways and this requirement is hard-coded everywhere...


| object | tag |
|---|---|
| block v1 | 43 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we could use multicodecs? Too big?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could use multicodecs, but i was just trying to fit into the cbor tags table. Unclear if thats worth doing...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll +1 using multicodecs with the caveat that I’m not sure offhand what the difference in size would be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikeal Yeah, Using multicodecs would be nice. Though it moves us out of the realm of using cbor, and into using our own custom encoding format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we already use cbor I would really like to stick with cbor tags, as this ensures cbor implementations are compatible out of the box and this is well defined cbor making other tools work well

Copy link
Contributor

@phritz phritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for writing this down. some comments/suggestions/questions below.

data-structures.md Show resolved Hide resolved
For example, a message would be encoded as:

```cbor
tag<44>[msg.To, msg.From, msg.Nonce, msg.Value, msg.Method, msg.Params]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm a little confused by the array notation in here, does it literally mean an major type 4 array to hold all the data items or does it denote just the sequence of data items? the difference is the presence or absence of major type 4 data item following the outer object tag, which contains the elements of the array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, major type 4 to hold all the data items, that way its a valid cbor object.

Datastructures in Filecoin are encoded as compactly as is reasonable. At a high level, each object is converted into an ordered array of its fields (ordered by their appearance in the struct declaration), then CBOR marshaled, and prepended with an object type tag.

| object | tag |
|---|---|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets have a very specific name for the left hand side that we use nowhere else so we can refer to things of that type unambiguously. is it object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about 'FCE type' for Filecoin Compact Encoding type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, anything that is unique like that works

Copy link
Member Author

@whyrusleeping whyrusleeping Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Name this something


## HAMT
Datastructures in Filecoin are encoded as compactly as is reasonable. At a high level, each object is converted into an ordered array of its fields (ordered by their appearance in the struct declaration), then CBOR marshaled, and prepended with an object type tag.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we need to say that encodings must follow the rules for canonicalization per https://tools.ietf.org/html/rfc7049#section-3.9 and that decoders should use strict mode https://tools.ietf.org/html/rfc7049#section-3.10. also decoders should disable unused features like streaming for security's sake, and set explicit limits (suggestions?) on the size of byte arrays, lists etc as a precaution against dos.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely

Copy link
Member Author

@whyrusleeping whyrusleeping Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would set limits on sizes of objects, canonicalize, strict decoding, and so on. i'll make a mention of that in the doc

  • TODO: this


## HAMT
Datastructures in Filecoin are encoded as compactly as is reasonable. At a high level, each object is converted into an ordered array of its fields (ordered by their appearance in the struct declaration), then CBOR marshaled, and prepended with an object type tag.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a goal to restrict the cbor data item types, eg to not encode to maps? if so we should say so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the case that we will ever encode something to major type 5? if not and there is a design reason for this eg concern about suitability of key types across languages then we should say so. also if we are not going to use type 5 they can turn it off for safety.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for the filecoin compact encoding format, we can leave that out. Though for the general cbor ipld stuff, people will still want to use that (i.e. in actor storage).

For the purpose of this doc, i think we could explicitly say its not used and should invalidate any object

| block v1 | 43 |
| message v1 | 44 |
| signedMessage v1 | 45 |
| actor v1 | 46 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think we should leave this out until we have use for it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair


## HAMT
Datastructures in Filecoin are encoded as compactly as is reasonable. At a high level, each object is converted into an ordered array of its fields (ordered by their appearance in the struct declaration), then CBOR marshaled, and prepended with an object type tag.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should note that we cannot distinguish between unset/missing and zero values given this encoding (which is fine). the encoder must always encode every field is for every object. maybe that is obvious but doesnt hurt to be explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably worth pointing out. I did like that this scheme removed that ambiguity.


## HAMT
Datastructures in Filecoin are encoded as compactly as is reasonable. At a high level, each object is converted into an ordered array of its fields (ordered by their appearance in the struct declaration), then CBOR marshaled, and prepended with an object type tag.

Copy link
Contributor

@phritz phritz Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this encoding scheme doesn't allow for forward compatibility so for example if we have a new object version old binaries are not going to be able to read it. this as opposed a scheme where fields are tagged an old implementations can just ignore new ones it doesnt understand. this is a feature i'm guessing, for size efficiency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does allow for forward compatibility. Every object is tagged, if you change an object, you pick a new tag.

In general in Filecoin, old binaries will not be able to understand new protocol formats, and it won't make sense for them to try to do so.

though maybe i can see a case where a thing wants to try and read messages, and if we were to add a new field for some weird reason, it wouldnt be able to read the old fields. I don't consider than much of an issue though, as my main concern is that the old impl knows the object is a different type, even if it can't understand it.

## HAMT
Datastructures in Filecoin are encoded as compactly as is reasonable. At a high level, each object is converted into an ordered array of its fields (ordered by their appearance in the struct declaration), then CBOR marshaled, and prepended with an object type tag.

| object | tag |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this encoding scheme look for multiple versions in the code? do you get a types.BlockV1 or types.BlockV2 depending on what you decoded and then have ifs? do you get a types.Block that has fields that might or might not have what you expect and you if on the version which is surfaced maybe in an unserialized field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

practically speaking, a change in block format will happen at a fixed height, and all blocks created after that height will be the new format. disambiguating should be straightforward.

I'm not 100% sure how implementations will handle this, but i would expect a single block type with any new fields added, and a version field the code can check (or just a height check, since it will need to be aware of that anyways)

Datastructures in Filecoin are encoded as compactly as is reasonable. At a high level, each object is converted into an ordered array of its fields (ordered by their appearance in the struct declaration), then CBOR marshaled, and prepended with an object type tag.

| object | tag |
|---|---|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementers will need a bunch of examples to ensure their implementation is working. uh, including us i suppose. also prolly a good idea to link to something that makes it easy to decode cbor to diagnostic output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll link this somewhere too, but this tool is really great for debug printing cbor, or even converting it to json

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • TODO: add examples and link diagnostics tool.

@whyrusleeping whyrusleeping force-pushed the feat/compact-serialization branch from 43bb175 to d9f2e6f Compare February 26, 2019 22:40
@whyrusleeping
Copy link
Member Author

Chatted with @warpfork about this. Some strong preferences, and some tricky bits:

  • strong preference to use a map with a single byte key instead of a tag. Tags don't play nicely with type systems.
  • strong preference for us to not use our own CID multicodec

alternative approach to having a CID multicodec would be to set a convention of prefixing CIDs with context, like "/fil/QmFooBar". This seems passable to me, but not foolproof. Questions remain about how a normal ipld object would link to a filecoin object.

Hard constraint is that a given path (i.e. "/parent/parent/state/foo") should always be unambiguous. There are various ways to achieve this, picking a good one is hard.

Downsides of using a multicodec:

  • Everything else would need plugins to understand these objects, even if just to replicate them
  • at some point we run out of one-byte multicodecs and then all our CIDs grow an extra byte and i will not stand for it

Alternative proposal to a filecoin specific multicodec:
We define a new multicodec called "well known CBOR schema" (or something) that implies a 'typed' CBOR object. These objects should all be valid DAG-CBOR, but should have a tag (or other well known thingy) that tells us what schema to use. Without the schema, the object cannot be pathed through (as the schema would be required for interpreting paths), but the object can still be traversed for outgoing links, so replicators can still handle and process it.

| Uint8 | CBOR Major type 0 |
| []byte | CBOR Major type 2 |
| string | CBOR Major type 3 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing boolean values ( not yet used but a good thing to have around I hear)

data-structures.md Outdated Show resolved Hide resolved
@whyrusleeping
Copy link
Member Author

Alright, i'm gonna merge this today. Any opposition, please speak now or file an issue afterwards complaining.

@bvohaska bvohaska mentioned this pull request Jul 1, 2019
27 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants